Transport services(运输层服务)
Transport services and proto
provide logical communication(逻辑通信) between app processes running on different hosts
transport protocols run in end systems
- send side: breaks app messages into segments, passes to network layer
- rcv side: reassembles segments(报文段) into messages, passes to app layer
more than one transport protocol available to apps
- Internet: TCP and UDP
Transport vs. network layer(运输层和网络层的关系)
network layer: logical communication between hosts
transport layer: logical communication between processes
- relies on, enhances, network layer services
e.g.:
12 kids in Ann’s house sending letters to 12 kids in Bill’s house
- hosts = houses
- processes = kids
- app messages = letters in envelopes
- transport protocol = Ann and Bill who demux to in-house siblings
- network-layer protocol = postal service
Internet transport-layer protocols(因特网传运输层)
- reliable, in-order delivery (TCP)
- congestion control
- flow control
- connection setup
unreliable, unordered delivery: UDP
- no-frills extension of “best-effort” IP(尽力而为)
- services not available:
- delay guarantees
- bandwidth guarantees
multiplexing and demultiplexing(多路复用与多路分解)
multiplexing at sender: handle data from multiple sockets, add transport header (later used for demultiplexing)
demultiplexing at receiver: use header info to deliver received segments to correct socket
How demultiplexing works
host receives IP datagrams
- each datagram has source IP address, destination IP address
- each datagram carries one transport-layer segment
- each segment has source, destination port number
host uses IP addresses & port numbers to direct segment to appropriate socket
Connectionless demultiplexing(无连接的多路分解)
created socket has host-local port #:
1DatagramSocket mySocket1 = new DatagramSocket(12534);when creating datagram to send into UDP socket, must specify
- destination IP address
- destination port #
when host receives UDP segment:
- checks destination port # in segment
- directs UDP segment to socket with that port #
IP datagrams with same dest. port #, but different source IP addresses and/or source port numbers will be directed to same socket at dest
Connection-oriented demux(面向连接的多路分解)
TCP socket identified by 4-tuple:
- source IP address
- source port number
- dest IP address
- dest port number
demux: receiver uses all four values to direct segment to appropriate socket
server host may support many simultaneous TCP sockets:
each socket identified by its own 4-tuple
web servers have different sockets for each connecting client
- non-persistent HTTP will have different socket for each request
connectionless transport: UDP(无连接运输: UDP)
UDP: User Datagram Protocol [RFC 768]
- 关于何时、发送什么数据的应用层控制更为精细
- 无需连接建立
- 无连接状态
分组首部开销小
“no frills,” “bare bones” Internet transport protocol
- “best effort” service, UDP segments may be:
- lost
- delivered out-of-order to app
connectionless:
- no handshaking between UDP sender, receiver
- each UDP segment handled independently of others
UDP use:
- streaming multimedia apps (loss tolerant, rate sensitive)
- DNS
- SNMP
reliable transfer over UDP:
- add reliability at application layer
- application-specific error recovery
UDP: segment header(UDP报文段首部)
length: in bytes of UDP segment, including header
- why is there a UDP
- no connection establishment (which can add delay)
- simple: no connection state at sender, receiver
- small header size
- no congestion control: UDP can blast away as fast as desired
UDP checksum(UDP检验和)
端到端原则(end-end principle)
Goal: detect “errors” (e.g., flipped bits) in transmitted segment
sender:
- treat segment contents, including header fields, as sequence of 16-bit integers
- checksum: addition (one’s complement sum) of segment contents
- sender puts checksum value into UDP checksum field
receiver:
- compute checksum of received segment
- check if computed checksum equals checksum field value:
- NO - error detected
- YES - no error detected. But maybe errors nonetheless? More later ….
e.g.: add two 16-bit integers
Note: when adding numbers, a carryout from the most significant bit needs to be added to the result
UDP Pseudo-Header(UDP伪首部)
- Protocol – 17 (UDP)
e.g. Checksum calculation of a simple UDP user datagram
- All 0s : Pending to 16bits
principles of reliable data transfer(可靠数据传输原理)
important in application, transport, link layers
: top-10 list of important networking topics
characteristics of unreliable channel will determine complexity of reliable data transfer protocol (rdt)
rdt: reliable data transfer
rdt_send(): called from above, (e.g., by app.). Passed data to deliver to receiver upper layer
udt_send(): called by rdt, to transfer packet over unreliable channel to receiver
rdt_rcv(): called when packet arrives on rcv-side of channel
deliver_data(): called by rdt to deliver data to upper
rdt1.0: reliable transfer over a reliable channel(经完全可靠信道的可靠数据传输)
incrementally develop sender, receiver sides of reliable data transfer protocol (rdt)
consider only unidirectional data transfer
- but control info will flow on both directions!
use finite state machines (FSM) to specify sender, receiver
underlying channel perfectly reliable
- no bit errors
- no loss of packets
separate FSMs for sender, receiver:
- sender sends data into underlying channel
- receiver reads data from underlying channel
rdt2.0: channel with bit errors(经具有比特差错信道的可靠数据传输)
- underlying channel may flip bits in packet
- checksum to detect bit errors
the question: how to recover from errors:
- acknowledgements(ACKs, 肯定确认): receiver explicitly tells sender that pkt received OK
- negative acknowledgements(NAKs, 否定确认): receiver explicitly tells sender that pkt had errors
- sender retransmits(重传) pkt on receipt of NAK
自动重传请求协议(Automatic Repeat reQuest, ARQ)
new mechanisms in rdt2.0 (beyond rdt1.0):
- error detection(差错检测)
- receiver feedback(接收方反馈): control msgs (ACK,NAK) rcvr->sender
停等(stop-and-wait)协议
rdt2.0 has a fatal flaw
- what happens if ACK/NAK corrupted
- sender doesn’t know what happened at receiver
- can’t just retransmit: possible duplicate
冗余分组(duplicate packet)
handling duplicates:
- sender retransmits current pkt if ACK/NAK corrupted
- sender adds sequence number to each pkt
- receiver discards (doesn’t deliver up) duplicate pkt
sender sends one packet, then waits for receiver response
rdt2.1: sender, handles garbled(含糊不清的) ACK/NAKs
- sender:
- seq # added to pkt
- two seq. #’s (0,1) will suffice.
- must check if received ACK/NAK corrupted
- twice as many states
- state must “remember” whether “expected” pkt should have seq # of 0 or 1
receiver:
- must check if received packet is duplicate
- state indicates whether 0 or 1 is expected pkt seq #
- note: receiver can not know if its last ACK/NAK received OK at sender
rdt2.2: a NAK-free protocol
- same functionality as rdt2.1, using ACKs only
- instead of NAK, receiver sends ACK for last pkt received OK
- receiver must explicitly include seq # of pkt being ACKed
- duplicate ACK at sender results in same action as NAK: retransmit current pkt
rdt3.0: channels with errors and loss(经具有比特差错的丢包信道的可靠数据传输)
new assumption: underlying channel can also lose packets (data, ACKs)
- checksum, seq. #, ACKs, retransmissions will be of help … but not enough
approach: sender waits “reasonable” amount of time for ACK
- retransmits if no ACK received in this time
- if pkt (or ACK) just delayed (not lost):
- retransmission will be duplicate, but seq. #’s already handles this
- receiver must specify seq # of pkt being ACKed
- requires countdown timer
rdt3.0 in action
Performance of rdt3.0
rdt3.0: stop-and-wait operation(停等)
- rdt3.0 is correct, but performance stinks
- e.g.: 1 Gbps link, 15 ms prop. delay, 8000 bit packet:
- $D_{trans} = \frac{L}{R} = \frac{8000bits}{10^9bits/sec} = 8microsecs$
RTT = 30ms
used ratio
- $ U_{sender} = \frac{L/R}{RTT + L/R} = \frac{0.008}{30.008} = 0.00027 $
- 33kB/sec thruput over 1 Gbps link
- network protocol limits use of physical resources
Pipelined protocols(流水线可靠数据传输协议)
- pipelining: sender allows multiple, “in-flight”, yet-to-be-acknowledged pkts
- range of sequence numbers must be increased
- buffering at sender and/or receiver
two generic forms of pipelined protocols: go-Back-N, selective repeat
- 3-packet pipelining increases utilization(利用率) by a factor of 3
$ U_{sender} = \frac{3L/R}{RTT + L/R} = \frac{0.024}{30.008} = 0.00081 $
Go-back-N(GBN, 回退N步):
- sender can have up to N unacked packets in pipeline
- receiver only sends cumulative ack
- doesn’t ack packet if there’s a gap
- sender has timer for oldest unacked packet
- when timer expires, retransmit all unacked packets
Selective Repeat(SR, 选择重传):
- sender can have up to N unack’ed packets in pipeline
- receiver sends individual ack for each packet
- sender maintains timer for each unacked packet
- when timer expires, retransmit only that unacked packet
Go-Back-N
Sender
- k-bit seq # in pkt header
- “window” of up to N, consecutive unack’ed pkts allowed
- ACK(n): ACKs all pkts up to, including seq # n - “cumulative ACK”
- may receive duplicate ACKs (see receiver)
- timer for oldest in-flight pkt
- timeout(n): retransmit packet n and all higher seq # pkts in window
- 窗口长度N
- 滑动窗口协议(sliding-window protocol)
sender extended FSM
receiver extended FSM
- ACK-only: always send ACK for correctly-received pkt with highest in-order seq #
- may generate duplicate ACKs
- need only remember expectedseqnum
- out-of-order pkt:
- discard (don’t buffer): no receiver buffering
- re-ACK pkt with highest in-order seq #
GBN in action
- 累积确认(cumulative acknowledgment)
Selective repeat
- receiver individually acknowledges all correctly received pkts
- buffers pkts, as needed, for eventual in-order delivery to upper layer
- sender only resends pkts for which ACK not received
- sender timer for each unACKed pkt
- sender window
- N consecutive seq #’s
- limits seq #s of sent, unACKed pkts
sender, receiver windows:
- sender
- data from above:
- if next available seq # in window, send pkt
- timeout(n):
- resend pkt n, restart timer
- ACK(n) in [sendbase,sendbase+N]:
- mark pkt n as received
- if n smallest unACKed pkt, advance window base to next unACKed seq #
receiver
- pkt n in [rcvbase, rcvbase+N-1]
- send ACK(n)
- out-of-order: buffer
- in-order: deliver (also deliver buffered, in-order pkts), advance window to next not-yet-received pkt
- pkt n in [rcvbase-N,rcvbase-1]
- ACK(n)
- otherwise: ignore
Selective repeat in action
Selective repeat: dilemma
- example:
- seq #’s: 0, 1, 2, 3
- window size=3
receiver sees no difference in two scenarios, duplicate data accepted as new in (b)
Q: what relationship between seq # size and window size to avoid problem in (b)?
SR协议中窗口长度必须小于或等于序号空间大小的一半
connection-oriented transport: TCP(面向连接的传输: TCP)
- point-to-point: one sender, one receiver
- reliable, in-order byte steam: no “message boundaries”
- pipelined: TCP congestion and flow control set window size
- full duplex data:
- bi-directional data flow in same connection
- MSS: maximum segment size
- connection-oriented: handshaking (exchange of control msgs) inits sender, receiver state before data exchange
flow controlled: sender will not overwhelm receiver
流(stream): 没有报文边界的概念
最大报文段长度(MSS, Maximum Segment Size)
最大链路层帧长度(MTU, Maximum Transmission Unit, 最大传输单元)
TCP segment structure(TCP报文段结构)
- sequence numbers(序号字段): byte stream “number” of first byte in segment’s data
- acknowledgements(确认号字段):
- seq # of next byte expected from other side
- cumulative ACK
- Q: how receiver handles out-of-order segments
A: TCP spec doesn’t say, - up to implementor
接收窗口字段(receive window): 用于流量控制,指示接收方愿意接受的字节数量
- Q: how to set TCP timeout value?
- longer than RTT but RTT varies
- too short: premature timeout, unnecessary retransmissions
too long: slow reaction to segment loss
Q: how to estimate RTT(估计往返时间)?
- SampleRTT: measured time from segment transmission until ACK receipt
- ignore retransmissions
- SampleRTT will vary, want estimated RTT “smoother”
- average several recent measurements, not just current SampleRTT
- $ EstimatedRTT = (1- \alpha)EstimatedRTT + \alphaSampleRTT $ *
- exponential weighted moving average(EWMA, 指数加权移动平均)
- influence of past sample decreases exponentially fast
- typical value: $\alpha = 0.125$
- timeout interval: EstimatedRTT plus “safety margin”
large variation in EstimatedRTT -> larger safety margin
estimate SampleRTT deviation from EstimatedRTT:
$$
DevRTT = (1-\beta)DevRTT +\beta |SampleRTT-EstimatedRTT|
(typically, \beta = 0.25)
$$
TimeoutInterval = EstimatedRTT(estimated RTT) + 4*DevRTT(“safety margin”)
TCP reliable data transfer(可靠数据传输)
- TCP creates rdt service on top of IP’s unreliable service
- pipelined segments
- cumulative acks
- single retransmission timer
- retransmissions triggered by:
- timeout events
- duplicate acks
let’s initially consider simplified TCP sender:
- ignore duplicate acks
- ignore flow control, congestion control
TCP sender events:
- data received from app:
- create segment with seq #
- seq # is byte-stream number of first data byte in segment
- start timer if not already running
- think of timer as for oldest unacked segment
- expiration interval: TimeOutInterval
timeout:
- retransmit segment that caused timeout
- restart timer
- ack received: if ack acknowledges previously unacked segments
- update what is known to be ACKed
- start timer if there are still unacked segments
TCP sender (simplified)
retransmission scenarios:
TCP ACK generation [RFC 1122, RFC 2581]
event at receiver | TCP receiver action |
---|---|
arrival of in-order segment with expected seq #. All data up to expected seq # already ACKed | delayed ACK. Wait up to 500ms for next segment. If no next segment, send ACK |
arrival of in-order segment with expected seq #. One other segment has ACK pending | immediately send single cumulative ACK, ACKing both in-order segments |
arrival of out-of-order segment higher-than-expect seq #. Gap detected | immediately send duplicate ACK, indicating seq. # of next expected byte |
arrival of segment that partially or completely fills gap | immediate send ACK, provided that segment starts at lower end of gap |
TCP fast retransmit(快速重传)
- time-out period often relatively long: long delay before resending lost packet
- detect lost segments via duplicate ACKs.
- sender often sends many segments back-to-back
- if segment is lost, there will likely be many duplicate ACKs.
if sender receives 3 dupl ACKs for same data(“triple duplicate ACKs”), resend unacked segment with smallest seq #
- likely that unacked segment lost, so don’t wait for timeout
TCP flow control(TCP流量控制)
receiver controls sender, so sender won’t overflow receiver’s buffer by transmitting too much, too fast
- receiver “advertises” free buffer space by including rwnd value in TCP header of receiver-to-sender segments
- RcvBuffer size set via socket options (typical default is 4096 bytes)
- many operating systems autoadjust RcvBuffer
- sender limits amount of unacked (“in-flight”) data to receiver’s rwnd value
- guarantees receive buffer will not overflow
Connection Management(TCP连接管理)
- before exchanging data, sender/receiver “handshake”:
- agree to establish connection (each knowing the other willing to establish connection)
- agree on connection parameters
- Q: will 2-way handshake always work in network?
- variable delays
- retransmitted messages (e.g. req_conn(x)) due to message loss
- message reordering
- can’t “see” other side
2-way handshake failure scenarios:
TCP 3-way handshake(三次握手)
TCP 3-way handshake: FSM
TCP: closing a connection(四次挥手)
- client, server each close their side of connection
- send TCP segment with FIN bit = 1
- respond to received FIN with ACK
- on receiving FIN, ACK can be combined with own FIN
- simultaneous FIN exchanges can be handled
Principles of congestion control(拥塞控制原理)
- congestion:
- informally: “too many sources sending too much data too fast for network to handle”
- different from flow control!
- manifestations:
- lost packets (buffer overflow at routers)
- long delays (queueing in router buffers)
- a top-10 problem
Causes/costs of congestion: scenario
- two senders, two receivers
- one router, infinite buffers
- output link capacity: R
- no retransmission
- one router, finite buffers
- sender retransmission of timed-out packet
- application-layer input = application-layer output: $\lambda{in} = \lambda{out}$
- transport-layer input includes retransmissions: $\lambda{in}^{‘} \geq \lambda{in}$
- idealization: perfect knowledge
- sender sends only when router buffers available
- Idealization: known loss packets can be lost, dropped at router due to full buffers
- sender only resends if packet known to be lost
- Realistic: duplicates
- packets can be lost, dropped at router due to full buffers
- sender times out prematurely, sending two copies, both of which are delivered
- “costs” of congestion:
- more work (retrans) for given “goodput”
- unneeded retransmissions: link carries multiple copies of pkt
- decreasing goodput
four senders
- multihop paths
timeout/retransmit
Q: what happens as $\lambda{in}$ and $\lambda{in}^{’}$ increase ?
- A: as red $\lambda_{in}^{’}$ increases, all arriving blue pkts at upper queue are dropped, blue throughput $\to$ 0
- another “cost” of congestion:
- when packet dropped, any upstream transmission capacity used for that packet was wasted(上游路由器用于转发该分组而使用的传输容量最终被浪费掉了)
TCP congestion control: additive increase multiplicative decrease(AIMD, 加性增,乘性减)
- approach: sender increases transmission rate (window size), probing for usable bandwidth, until loss occurs
- additive increase: increase cwnd by 1 MSS every RTT until loss detected
- multiplicative decrease: cut cwnd in half after loss
sender limits transmission: $ LastByteSend - LastByteAcked \leq cwnd $
cwnd(拥塞窗口长度) is dynamic, function of perceived network congestion
TCP sending rate:
- roughly: send cwnd bytes, wait RTT for ACKS, then send more bytes
- $ rate \approx \frac{cwnd}{RTT} bytes/sec $
TCP Slow Start(慢启动)
- when connection begins, increase rate exponentially until first loss event:
- initially cwnd = 1 MSS
- double cwnd every RTT
- done by incrementing cwnd for every ACK received
- summary: initial rate is slow but ramps up exponentially fast
detecting, reacting to loss
- loss indicated by timeout:
- cwnd set to 1 MSS;
- window then grows exponentially (as in slow start) to threshold, then grows linearly(进入拥塞避免状态)
loss indicated by 3 duplicate ACKs: TCP RENO(进入快速恢复状态)
- dup ACKs indicate network capable of delivering some segments
- cwnd is cut in half window then grows linearly
TCP Tahoe always sets cwnd to 1 (timeout or 3 duplicate acks)(进入慢启动状态)
switching from slow start to CA(Congestion Avoid)
- Q: when should the exponential increase switch to linear?
A: when cwnd gets to 1/2 of its value before timeout.
Implementation:
- variable ssthresh
- on loss event, ssthresh is set to 1/2 of cwnd just before loss event
TCP throughput(TCP吞吐量)
- avg. TCP thruput as function of window size, RTT?
- ignore slow start, assume always data to send
W: window size (measured in bytes) where loss occurs
- avg. window size (# in-flight bytes) is 3/4 W
- avg. thruput is 3/4W per RTT
- $ avg TCP throughput = \frac{3}{4} \frac{W}{RTT} bytes/sec $
TCP Futures: TCP over “long, fat pipes”(经高带宽路径的TCP)
- example: 1500 byte segments, 100ms RTT, want 10 Gbps throughput
- requires W = 83,333 in-flight segments
- throughput in terms of segment loss probability, L [Mathis 1997]:
- $ TCP thoughput = \frac{1.22*MSS}{RTT\sqrt{L}} $
- to achieve 10 Gbps throughput, need a loss rate of L = 2·10-10 – a very small loss rate
- new versions of TCP for high-speed
TCP Fairness(TCP公平性)
fairness goal: if K TCP sessions share same bottleneck link of bandwidth R, each should have average rate of R/K
Why is TCP fair
- two competing sessions:
- additive increase gives slope of 1, as throughout increases
- multiplicative decrease decreases throughput proportionally
Fairness and UDP
- multimedia apps often do not use TCP
- do not want rate throttled by congestion control
- instead use UDP:
- send audio/video at constant rate, tolerate packet loss
Fairness, parallel TCP connections
- application can open multiple parallel connections between two hosts
- web browsers do this
- e.g., link of rate R with 9 existing connections:
- new app asks for 1 TCP, gets rate R/10
- new app asks for 11 TCPs, gets R/2
Explicit Congestion Notification (ECN)
- network-assisted congestion control:
- two bits in IP header (ToS field) marked by network router to indicate congestion
- congestion indication carried to receiving host
- receiver (seeing congestion indication in IP datagram), sets ECE bit on receiver-to-sender ACK segment to notify sender of congestion